Skip to content

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 8, 2025

With this change, we pack the dimension into a single value before the second aggregation in time-series queries and unpack it afterward. This avoids generating permutations for multi-valued dimensions in the second aggregation, which is not desirable.

For example, the query

TS k8s | STATS max(rate(request)) BY host, tbucket(1minute)

is rewritten as:

TS k8s
 | STATS rate=rate(request), host=VALUES(host) BY _tsid, tbucket=TBUCKET(1minute)
 | EVAL packed_host=PACK_DIMENSION(host)
 | STATS sum(rate) BY packed_host, tbucket
 | EVAL host=UNPACK_DIMENSION(packed_host)
 | KEEP rate, host, tbucket

There is some overhead with packing and unpacking values, but we tried to isolate this behavior to time-series queries with dimension fields only. That is why I chose this approach.

@dnhatn dnhatn force-pushed the pack-dimensions branch 3 times, most recently from 49bf9c8 to 25b5f7e Compare October 10, 2025 03:51
$if(BytesRef)$
public BytesRef getBytesRef(int position, BytesRef ignore) {
public BytesRef getBytesRef(int position, BytesRef scratch) {
scratch.bytes = value.bytes;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a separate PR for this fix.

assertThat(Expressions.attribute(rate.field()).name(), equalTo("network.total_bytes_in"));
LastOverTime lastSum = as(Alias.unwrap(aggsByTsid.aggregates().get(1)), LastOverTime.class);
assertThat(Expressions.attribute(lastSum.field()).name(), equalTo("network.cost"));
Values clusterValues = as(Alias.unwrap(aggsByTsid.aggregates().get(3)), Values.class);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update these tests.

return Math.max(INITIAL_SIZE_IN_BYTES, positionCount);
}

static BytesRefBlock packBytesValues(DriverContext driverContext, BytesRefBlock raw) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we encode multi-valued block into a single value.

}
}

static BytesRefBlock unpackBytesValues(DriverContext driverContext, BytesRefBlock encoded) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And decode here.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 10, 2025

I will also verify this change with the competitive benchmark.

7.203958127639015 | prod | [eu, us] | 2024-05-10T00:10:00.000Z
6.34494062999877 | staging | us | 2024-05-10T00:10:00.000Z
5.700488689624205 | prod | [eu, us] | 2024-05-10T00:20:00.000Z
5.4539153439153445 | prod | [eu, us] | 2024-05-10T00:00:00.000Z
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return arrays for grouping dimensions.

firstPassAggs.add(newFinalGroup);
var valuesAgg = new Alias(g.source(), g.name(), new Values(g.source(), g));
firstPassAggs.add(valuesAgg);
if (g.isDimension()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with labels, i.e. grouping attributes that are neither dimensions nor metrics? I think multi-values were not allowed in dimensions for a long time, so this may be the case for some older configs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we want to pack not only dimensions but also labels. If that's the case, we might need a different approach.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks Nhat. Let's check the numbers on competitive benchmark, in case there are big surprises.

@dnhatn dnhatn added :StorageEngine/TSDB You know, for Metrics auto-backport Automatically create backport pull requests when merged labels Oct 13, 2025
@dnhatn dnhatn marked this pull request as ready for review October 13, 2025 23:35
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn
Copy link
Member Author

dnhatn commented Oct 13, 2025

Thanks Kostas!

@dnhatn dnhatn merged commit c8dc04d into elastic:main Oct 13, 2025
34 checks passed
@dnhatn dnhatn deleted the pack-dimensions branch October 13, 2025 23:36
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 136216

@kkrik-es
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
9.2

Questions ?

Please refer to the Backport tool documentation

kkrik-es pushed a commit to kkrik-es/elasticsearch that referenced this pull request Oct 14, 2025
With this change, we pack the dimension into a single value before the
second aggregation in time-series queries and unpack it afterward. This
avoids generating permutations for multi-valued dimensions in the second
aggregation, which is not desirable.

For example, the query

```
TS k8s | STATS max(rate(request)) BY host, tbucket(1minute)
```

is rewritten as:

```esql
TS k8s
 | STATS rate=rate(request), host=VALUES(host) BY _tsid, tbucket=TBUCKET(1minute)
 | EVAL packed_host=PACK_DIMENSION(host)
 | STATS sum(rate) BY packed_host, tbucket
 | EVAL host=UNPACK_DIMENSION(packed_host)
 | KEEP rate, host, tbucket
```

There is some overhead with packing and unpacking values, but we tried
to isolate this behavior to time-series queries with dimension fields
only. That is why I chose this approach.

(cherry picked from commit c8dc04d)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
kkrik-es added a commit that referenced this pull request Oct 14, 2025
With this change, we pack the dimension into a single value before the
second aggregation in time-series queries and unpack it afterward. This
avoids generating permutations for multi-valued dimensions in the second
aggregation, which is not desirable.

For example, the query

```
TS k8s | STATS max(rate(request)) BY host, tbucket(1minute)
```

is rewritten as:

```esql
TS k8s
 | STATS rate=rate(request), host=VALUES(host) BY _tsid, tbucket=TBUCKET(1minute)
 | EVAL packed_host=PACK_DIMENSION(host)
 | STATS sum(rate) BY packed_host, tbucket
 | EVAL host=UNPACK_DIMENSION(packed_host)
 | KEEP rate, host, tbucket
```

There is some overhead with packing and unpacking values, but we tried
to isolate this behavior to time-series queries with dimension fields
only. That is why I chose this approach.

(cherry picked from commit c8dc04d)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java

Co-authored-by: Nhat Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >non-issue :StorageEngine/TSDB You know, for Metrics Team:StorageEngine v9.2.0 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants